Igemm experiment #43

SuperFluffy · 2018-12-09T11:14:04Z

This is a first shot at implementing gemm using i8 as input, and i16 as output.

The main change is to the gemm loops, which are now able to have different types as input and output.

Todo:

Adapt integration tests for i8gemm;
Fix the masked buffers to work with different kernel dimensions;

SuperFluffy · 2018-12-09T13:05:51Z

@bluss: I have adjusted gemm_packed to work with kernels of shapes other then 8x8 and 8x4. However, Rust seems to have issues finding the correct associated types and constants of the K: GemmKernel, and I can't figure out, why. The suggested fixes are basically what's already written.

Can you have a look?

 % cargo check

    Checking matrixmultiply v0.2.1 (/home/janis/foreign_git/matrixmultiply)
error[E0599]: no associated item named `MR` found for type `K` in the current scope
   --> src/gemm.rs:213:30
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                              ^^^^^ associated item not found in `K`
    |
    = help: items from traits can only be used if the trait is implemented and in scope
    = note: the following trait defines an item `MR`, perhaps you need to implement it:
            candidate #1: `kernel::GemmKernel`

error[E0599]: no associated item named `NR` found for type `K` in the current scope
   --> src/gemm.rs:213:38
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                                      ^^^^^ associated item not found in `K`
    |
    = help: items from traits can only be used if the trait is implemented and in scope
    = note: the following trait defines an item `NR`, perhaps you need to implement it:
            candidate #1: `kernel::GemmKernel`

error[E0220]: associated type `ElemIn` not found for `K`
   --> src/gemm.rs:213:56
    |
213 |     let mut mask_buf = [0u8; K::MR * K::NR * size_of::<K::ElemIn>() + 31];
    |                                                        ^^^^^^^^^ associated type `ElemIn` not found

error: aborting due to 3 previous errors

Yet just a few lines above I clearly have K: GemmKernel... The extra bounds on the associated types ElemIn and ElemOut are probably redundant. I just added them to help the compiler.

bluss · 2018-12-09T13:46:26Z

@SuperFluffy don't have time to look at it all, but associated constants can not be used in array sizes. Rust can't do that.

SuperFluffy · 2018-12-09T15:21:44Z

@bluss Thanks for the note, was looking through RFCs. That's another use for const generics, I guess.

I wonder if replacing the buffer by a Vec would be very bad. Will test it.

SuperFluffy · 2018-12-09T20:24:50Z

Wow, for small matrices using a Vec leads to some serious slowdowns:

 name                 buf_mask ns/iter  vec_mask ns/iter  diff ns/iter  diff %  speedup 
 layout_f64_032::ccc  2,187             2,244                       57   2.61%   x 0.97 
 layout_f64_032::ccf  2,150             2,174                       24   1.12%   x 0.99 
 layout_f64_032::cfc  2,371             2,418                       47   1.98%   x 0.98 
 layout_f64_032::cff  2,334             2,361                       27   1.16%   x 0.99 
 layout_f64_032::fcc  2,005             2,048                       43   2.14%   x 0.98 
 layout_f64_032::fcf  1,972             1,991                       19   0.96%   x 0.99 
 layout_f64_032::ffc  2,173             2,216                       43   1.98%   x 0.98 
 layout_f64_032::fff  2,114             2,150                       36   1.70%   x 0.98 
 mat_mul_f64::m004    143               187                         44  30.77%   x 0.76 
 mat_mul_f64::m006    216               249                         33  15.28%   x 0.87 
 mat_mul_f64::m008    225               262                         37  16.44%   x 0.86 
 mat_mul_f64::m012    410               423                         13   3.17%   x 0.97 
 mat_mul_f64::m016    480               517                         37   7.71%   x 0.93 
 mat_mul_f64::m032    2,191             2,215                       24   1.10%   x 0.99 
 mat_mul_f64::m064    14,075            13,805                    -270  -1.92%   x 1.02 
 mat_mul_f64::m127    97,784            98,549                     765   0.78%   x 0.99

SuperFluffy · 2018-12-10T12:53:07Z

Blowing up the masked buffer to 1024 (16322, kernel is 16x32, i16 takes 2 bytes) elements at least doesn't seem to affect performance:

running 16 tests
test layout_f64_032::ccc ... bench:       2,189 ns/iter (+/- 79)
test layout_f64_032::ccf ... bench:       2,141 ns/iter (+/- 55)
test layout_f64_032::cfc ... bench:       2,373 ns/iter (+/- 72)
test layout_f64_032::cff ... bench:       2,313 ns/iter (+/- 83)
test layout_f64_032::fcc ... bench:       2,015 ns/iter (+/- 60)
test layout_f64_032::fcf ... bench:       1,960 ns/iter (+/- 49)
test layout_f64_032::ffc ... bench:       2,185 ns/iter (+/- 69)
test layout_f64_032::fff ... bench:       2,115 ns/iter (+/- 57)
test mat_mul_f64::m004   ... bench:         154 ns/iter (+/- 6)
test mat_mul_f64::m006   ... bench:         228 ns/iter (+/- 1)
test mat_mul_f64::m008   ... bench:         231 ns/iter (+/- 6)
test mat_mul_f64::m012   ... bench:         419 ns/iter (+/- 13)
test mat_mul_f64::m016   ... bench:         475 ns/iter (+/- 21)
test mat_mul_f64::m032   ... bench:       2,213 ns/iter (+/- 59)
test mat_mul_f64::m064   ... bench:      14,128 ns/iter (+/- 339)
test mat_mul_f64::m127   ... bench:      98,621 ns/iter (+/- 2,590)

It makes me unhappy to do that, but maybe it's ok?

The only other option I see is to make a macro for each *gemm that passes in a buffer size. But that still doesn't solve that ideally we'd have a different buffer size for each kernel type.

SuperFluffy · 2018-12-10T14:59:57Z

This certainly needs more tuning. This is some terrible performance, as of now:

cargo bench i8

running 18 tests
test layout_i8_128::ccc  ... bench:      86,794 ns/iter (+/- 3,997)
test layout_i8_128::ccf  ... bench:      88,833 ns/iter (+/- 4,850)
test layout_i8_128::cfc  ... bench:      93,464 ns/iter (+/- 5,792)
test layout_i8_128::cff  ... bench:      82,951 ns/iter (+/- 2,452)
test layout_i8_128::fcc  ... bench:      75,052 ns/iter (+/- 11,858)
test layout_i8_128::fcf  ... bench:      75,082 ns/iter (+/- 14,399)
test layout_i8_128::ffc  ... bench:      78,170 ns/iter (+/- 12,755)
test layout_i8_128::fff  ... bench:      79,103 ns/iter (+/- 11,437)
test mat_mul_i8::m004    ... bench:         338 ns/iter (+/- 19)
test mat_mul_i8::m006    ... bench:         473 ns/iter (+/- 64)
test mat_mul_i8::m008    ... bench:         591 ns/iter (+/- 50)
test mat_mul_i8::m012    ... bench:         887 ns/iter (+/- 54)
test mat_mul_i8::m016    ... bench:       1,161 ns/iter (+/- 262)
test mat_mul_i8::m032    ... bench:       2,889 ns/iter (+/- 639)
test mat_mul_i8::m064    ... bench:      14,749 ns/iter (+/- 3,077)
test mat_mul_i8::m127    ... bench:      93,135 ns/iter (+/- 11,694)
test mat_mul_i8::m128    ... bench:      95,444 ns/iter (+/- 14,300)
test mat_mul_i8::m255    ... bench:     607,507 ns/iter (+/- 85,093)

SuperFluffy · 2018-12-10T15:37:25Z

The reason for those atrocious numbers is that I probably don't have the number of available vector, ymm*, registers in mind. avx has 16 vector registers in total.

This issue becomes easily obvious by looking at the kernel size I have chosen: in the kernel, I accumulate in an array of size 32. That means that for the accumulation variables alone I need 32 registers. Add on top the column from a and each element in the row from b, and the accumulation step within the unrolled loop needs 65 registers.

The benchmarks above thus most probably measure the CPU loading data in and out of memory.

bluss · 2018-12-11T20:12:43Z

Sounds interesting. :)

I'd propose some targets:

rebase/focus on merging with master
separate out refactorings that are unrelated to the actual feature
use a new crate feature for this. It can be default during development and for docs.rs, but probably not default for standard build

SuperFluffy · 2018-12-11T22:47:18Z

The way forward to implement an efficient integer gemm is to change the packing of the matrices that the kernel accesses. Google's gemmlowp is doing exactly that. An example can be seen here: https://github.com/google/gemmlowp/blob/master/internal/kernel.h#L116-L119

0  1  2  3  4  20 21 22 23 24
5  6  7  8  9  25 26 27 28 29
10 11 12 13 14 30 31 32 33 34
15 16 17 18 19 35 36 37 38 39

Here, two sub blocks (or “cell” in their terminology) of dimensions 4x5 each are placed right next to each other. Each cell is row major in itself.

For an avx2 kernel performing matrix multiplication between two i8 matrices, an optimal kernel would look something like this

    b 0 1 2 3
 a    4 5 6 7
0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

0 8
1 9
2 a
3 b
4 c
5 d
6 e
7 f

We can load 16 i8 and sign-extend them to 16 i16 into one register. This takes up ymm0 and ymm1 for a and b. The problem is that the outer product between one column, 16x1, and one row, 1x16, requires 16 registers to accumulate their product. We only have 16 registers in total, which is already more than we have available.

An alternative is to consider multiplying 8x2 and 2x8 (two registers total). The intrinsic vpmaddwd allows doing the multiplication and addition in one go, accumulating the result in 32-bit vectors. We'd need 8 of those, so we have now used 10, leaving us with 6. That's too many spare ones (we don't need so many to do the intermediate steps like shuffling, etc). Using two columns, so having a 16x2 panel), would again exceed the number of available registers.

The best option

I think the optimal solution is to do what gemmlowp is doing: multiply 3 8x2 cells of matrix a and one 2x4 cell matrix b. This way, we can accumulate in 3*4 = 12 registers, loading new data from a three times. This takes 14 registers for accumulation and cells, leaving 2 registers to perform intermediate operations in. The main task is hence to pack 8x2 cells such that they are adjacent to each other, and pack 2x4 cells such that they have additional padding of 8 bytes.

@bluss How much of the loops do you think would need to be adapted to permit such packing?

SuperFluffy added 5 commits December 7, 2018 18:17

WIP

5e22539

Generalize gemm to allow different input and output arrays

0c19f42

Remove kernel for i32, implement i8

86d1f5c

First shot at generalizing and fixing the integration tests

1ae7f38

Make gemm loop work with different kernel dimensions

26e6c82

Increase the buffer size to accomodate very large kernels

49ad58a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Igemm experiment #43

Igemm experiment #43

SuperFluffy commented Dec 9, 2018 •

edited

Loading

SuperFluffy commented Dec 9, 2018 •

edited

Loading

bluss commented Dec 9, 2018 •

edited

Loading

SuperFluffy commented Dec 9, 2018 •

edited

Loading

SuperFluffy commented Dec 9, 2018

SuperFluffy commented Dec 10, 2018

SuperFluffy commented Dec 10, 2018

SuperFluffy commented Dec 10, 2018

bluss commented Dec 11, 2018

SuperFluffy commented Dec 11, 2018

Igemm experiment #43

Are you sure you want to change the base?

Igemm experiment #43

Conversation

SuperFluffy commented Dec 9, 2018 • edited Loading

Todo:

SuperFluffy commented Dec 9, 2018 • edited Loading

bluss commented Dec 9, 2018 • edited Loading

SuperFluffy commented Dec 9, 2018 • edited Loading

SuperFluffy commented Dec 9, 2018

SuperFluffy commented Dec 10, 2018

SuperFluffy commented Dec 10, 2018

SuperFluffy commented Dec 10, 2018

bluss commented Dec 11, 2018

SuperFluffy commented Dec 11, 2018

The best option

SuperFluffy commented Dec 9, 2018 •

edited

Loading

SuperFluffy commented Dec 9, 2018 •

edited

Loading

bluss commented Dec 9, 2018 •

edited

Loading

SuperFluffy commented Dec 9, 2018 •

edited

Loading